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Abstract Visual features can help predict if a manipu- 
lation behavior will succeed at a given location. For ex- 
ample, the success of a behavior that flips light switches 
depends on the location of the switch. Within this pa- 
per, we present methods that enable a mobile manip- 
ulator to autonomously learn a function that takes an 
RGB image and a registered 3D point cloud as input 
and returns a 3D location at which a manipulation be- 
havior is likely to succeed. Given a pair of manipula- 
tion behaviors that can change the state of the world 
between two sets (e.g., light switch up and light switch 
down), classifiers that detect when each behavior has 
been successful, and an initial hint as to where one 
of the behaviors will be successful, the robot autono- 
mously trains a pair of support vector machine (SVM) 
classifiers by trying out the behaviors at locations in the 
world and observing the results. When an image feature 
vector associated with a 3D location is provided as in- 
put to one of the SVMs, the SVM predicts if the asso- 
ciated manipulation behavior will be successful at the 
3D location. To evaluate our approach, we performed 
experiments with a PR2 robot from Willow Garage in a 
simulated home using behaviors that flip a light switch, 
push a rocker-type light switch, and operate a drawer. 
By using active learning, the robot efficiently learned 
SVMs that enabled it to consistently succeed at these 
tasks. After training, the robot also continued to learn 
in order to adapt in the event of failure. 
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Fig. 1 Left: Willow Garage PR2 operating a drawer, light 
switch and rocker switch using learned detector that detects 
regions where manipulation will succeed. Right: Results from 
learned detectors during execution. 

1 Introduction 

Informing robot manipulation with computer vision con- 
tinues to be a challenging problem in unstructured hu- 
man environments, such as homes. Two types of chal- 
lenges are particularly notable. First, the robot must 
handle wide variation in the appearance of task-relevant 
components of the world that can affect its ability to 
perform tasks successfully. Lighting can vary from home 
to home and from hour to hour due to indoor lighting 
and windows. In addition, important components, such 
as drawer handles and the drawer faces that serve as 
background, can be distinctive or even unique. The per- 
spective from which a mobile robot observes the com- 
ponent will also vary. 



2 



Hai Nguyen, and Charles C. Kemp 



Second, the relationship between the appearance of 
task-relevant components and the success or failure of 
a manipulation behavior is complex. For example, the 
mechanics of a specific device may require that the 
robot act at a distinct location, such as a finicky drawer 
that needs to be pushed in the center to be closed, or 
a convoluted handle that the robot's gripper can only 
grasp at particular locations. The robot itself may also 
change over time and thus alter the relationship be- 
tween visual appearance and a manipulation behavior, 
as parts of its body settle, deform, and wear. 

One potential solution to these two problems is for 
robots to autonomously learn how specific objects re- 
spond to manipulation attempts using a behavior, and 
to continue to learn as they perform tasks. By using self- 
generated data, robots can learn direct mappings from 
visual features to the input parameters for behaviors, 
enabling robust execution despite errors in calibration, 
variations in robot pose, sensor noise, unexpected envi- 
ronmental interactions, and other factors. By continu- 
ing to learn over time, robots can also adapt to changes 
in the environment, the objects, and their bodies. 

In this work, we present a system that enables mo- 
bile manipulators to autonomously gather data about 
the execution of behaviors to improve their likelihood 
of success in future attempts. Our work advances au- 
tonomous robot learning in three ways. First, our re- 
search addresses challenges of learning in scenarios that 
integrate mobility and manipulation. During our tests, 
the robot navigates to the device from various places in 
the environment. Our approach uses a robot's mobil- 
ity as an integral part of autonomous learning, which 
enables the robot to handle the significant task varia- 
tion introduced by its mobility. Second, we show that 
autonomously learning to visually predict where a be- 
havior will be successful can be tractable in real-world 
scenarios. By using active learning, the robots in our 
tests learned each visual function after fewer than 150 
interactions with each device, even though the robot 
started from scratch and only used data it collected. 
The learned visual functions enabled the robots to suc- 
cessfully operate the devices and also have intuitive in- 
terpretations. Third, our methods autonomously learn 
to operate devices that have an approximately binary 
state, such as a light switch being up or down or a 
drawer being open or closed. This presents a challenge, 
since the robot's actions change the state of the world, 
which deters the robot from trying the same action 
again. For example, it would be difficult to learn to 
open a drawer if, once it is open, the robot is unable to 
close it. Our system addresses this difficulty by simul- 
taneously training pairs of behaviors and alternating 
between them as necessary. We also formalize the ideal 



relationship between these pairs of behaviors and name 
them complementary behaviors. 

We evaluated our system using an implementation 
on a Willow Garage PR2 robot [2] at the Aware Home, 
which is a free-standing house at the Georgia Insti- 
tute of Technology constructed to test new technolo- 
gies. First, the robot autonomously learned to operate 
6 devices. After learning, we tested the robot's perfor- 
mance in trials with each of the 6 devices for a total of 
110 trials (110 trials ~ (5 devices * 2 behaviors * 10 
trials) + (1 device * 2 behaviors * 5 trials)). In all 110 
trials, the robot autonomously operated the device suc- 
cessfully after at most two attempts. If the first attempt 
failed, the robot autonomously detected the failure and 
then retrained using this new negative example prior to 
autonomously trying a second time. We tested opening 
and closing drawers, turning on and off light switches, 
and turning on and off rocker switches. Figure [l] shows 
example output from the resulting trained classifiers, 
which classify image feature vectors as being associated 
with success or failure of a behavior. 

2 Related Work 

There is a significant body of work on robots learning to 
manipulate autonomously, robots learning to perceive, 
perception for manipulation that exploits task struc- 
ture, and active learning methods for using labeling ef- 
forts efficiently. In this section, we discuss the relation- 
ship between our work and current learning methods as 
well as work that has demonstrated the effectiveness of 
using task-relevant cues for perception in human envi- 
ronments. The research we present in this paper builds 



on our earlier workshop publication 47 



2.1 Robot Learning 

Even though the use of learning-based methods can 
yield powerful detectors, labeled training examples are 
often time consuming and expensive to obtain. Differ- 
ent robot learning methods such as imitation learning, 
interactive learning and developmental learning [37[|51| 
can be grouped by how they approach the issue of gath- 
ering data. We now discuss different forms of robot 
learning methods and specifically work that involves 
autonomous learning, where the robot learns with little 
or no human input after an initialization or teaching 
period. 

2.1.1 Autonomously Learning to Act 

Developmental learning research primarily uses data 
from the robot's own interactions with its environment 
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and emphasizes scenarios inspired by the development 
of biological organisms 36 . Examples include studies 
of gaze control j8j|9jjll] , reaching [l0j|43j , pointing ^ , 
and poking [42] . 

In the manipulation literature there has also been 
interest in using autonomous learning based methods to 
find stable grasps. One of the earliest investigations in 
grasp learning is by Dunn and Segen [l9]) that matched 
objects using visual features and learned candidate gr- 
asps through trial and error. In [68, , instead of learning 
one grasping classifier the system learned separate clas- 
sifiers for grasp position and grasp orientation. Saxena 
et al. [56] presented a method that learned a grasp point 
classifier using a data set of simulated grasps and was 
able to show success in grasping real objects in unclut- 
tered environments. Similarly, researchers in 



showed a grasp learning algorithm, mapping from 3D 
edge features using an active learning approach com- 



bined with semi-supervised learning. The authors of 29 



used a supervised classifier to detect handles and dis- 
tinguish between left and right facing versions. Work in 
44 from the developmental learning community views 
the same problem as one of learning object affordances 
and proposed a method for estimating grasp densities 
in images of objects on plain backgrounds. 

Research on learning from demonstration has inves- 
tigated how to learn policies and controllers for manip- 
ulation but, in most cases, has not addressed perceptual 
challenges. We refer readers to [s] for an extensive sur- 
vey of existing methods. 

While many projects focus on inferring policies from 
human demonstrations [s], there is a subset of work 
where autonomous learning is used with dynamic mo- 
to refine initial human demonstra- 
present methods for 



tion primitives 
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tions. Using this framework, 
parameterizing motions refined from human demonstra- 
tions based on task objectives. More recently. Pastor et 
al. 



50 



implemented this framework on the PR2 robot 
to show the PR2 flipping a box using chopsticks and 
playing pool. In contrast, work in the domain of heli- 
copter acrobatic flights [64J used a combination of hu- 
man demonstrations and approximate models, instead 
of real world practice, to extract intended trajectories 
from sets of noisy demonstrations. In the system pre- 
sented by by Prats et al. 53 , the authors eschewed 



trajectories and instead focused on creating a system 
that records task properties in terms of forces applied 
with respect to a user-defined visual reference frame. 



2.1.2 Autonomously Learning to Perceive 



In contrast to motor learning, most work in learning 



for perception relies on data captured manually 52 



captured in simulation [29[|56| , or downloaded from the 
web (3|[T3]. Although large data sets can be collected 
from these sources, the data generated can be biased 
and may not match what the robot will encounter. 
Likewise, the relationship between these data and the 
robot's actions may not be clear. For example, a good 
location for a person to grasp or a location that a per- 
son believes would be good for grasping may not be 
appropriate for a robot, given its end effector and other 
differences. Additionally, there are challenges in pro- 
cessing and generating such data. Text on the web is 



20 also written in natural language for human beings 65 . Well- 



framed aesthetically pleasing images can be misleading 
for robots using noisy sensors from opportunistic points 
of view. Accurate simulation of physical objects can also 
be hard to obtain [4j. 

The system that we present uses data generated 
from self-experience, (similar to 20 50 55 , and [30]). 



With self-generated data, generalization becomes less 
of an issue as the training data and data encountered 
online are sampled from the same distribution. Even 
so, labeled examples can be costly to obtain. Inter- 
actions with the environment take time, and can po- 
tentially result in damage to the environment and the 
robot. Human labeling can be labor intensive and have 
errors, ambiguity, and inconsistencies |6 . We address 
this issue in our work by combining active learning, 
which reduces the number of examples needed, with 
autonomous learning methods that eliminate the need 
for human labeling beyond an initialization process. 

Past work in learning for perceptual categorization, 
a process where agents learn through interaction with 
the world to divide sensory information into distinct 
groupings, has used robot self-generated data. However, 
most systems were designed to classify only simple ge- 
ometric objects such as cylinders and rectangles using 
cross-modal information ^14„15„33] . 

A relatively small subset of work investigates more 



complex objects found in human environments. In 61 



Stober et al. demonstrated an approach for extracting 
spatial and geometric information from raw sensorimo- 
tor data. Sukhoy and Stoytchev j63] presented an ac- 
tive learning method for a robot pressing doorbell but- 



tons. Kraft et al. 32 presented a system that gradually 



Using a generative method with a planning frame- learns object representations and associates them with 



work, Stulp et al. 62 introduced the concept of action- object-specific grasps. Katz and Brock in 26 showed a 



related places, modeling how a robot's navigational un- 
certainty affects its ability to execute grasps. 



method with which a robot determines the structure of 
articulated objects through experimentation. Similarly, 
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Hoof et al. 67 presented a system that selects maxi- 



mally informative actions to segment tabletop scenes. 



Paolini et al. in 49 present related work in which a 



stationary robot estimates which parameters to provide 
to a robot behavior based on experience. Like our ap- 
proach, the parameters to the behavior are in terms of a 
location. For them, the parameters to the behavior de- 
scribe the pose of a highlighter held in the robot's grip- 
per and the behavior attempts to stand the highlighter 
upright. Their approach uses estimated probability dis- 
tributions, including a distribution that gives the prob- 
ability of the marker's pose conditioned on haptic sens- 
ing. In contrast, we use a discriminative approach and 
directly map visual sensing to 3D locations that are 
likely to result in success of the behavior. 

The work of [63] is notable for its similarity to our 
approach. They presented a system that uses an uncer- 
tainty sampling scheme to actively learn the appearance 
of doorbell buttons. In contrast, our approach uses a 
different active learning algorithm, works with a mo- 
bile manipulator operating in situ devices, and handles 
persistent change to the state of the world. 

2.2 Task- Relevant Feature Detection 

In a parallel thread to robot learning, there has been 
recognition in the mobile manipulation community of 
the importance of exploiting task structure to reduce 
the complexity of operating in the real-world. This point 
was argued by Katz et al. in 27 . Dang and Allen 
18 showed evidence that many manipulation tasks can 
be described using sequences of rotations and transla- 
tions. Additionally, work in articulated object percep- 
tion 26 , tool tip detection 28 , door handle detection 



29 , behavior-based grasping 23 , and corner detection 



for towel folding |38| suggests that, in many tasks, re- 
covery of complex representations of the state of objects 
prior to manipulation is unnecessary. For example, Jain 



and Kemp 24 demonstrated that overhead grasping of 
diverse real-world objects can be successfully accom- 
plished by representing a segmented object as a planar 
ellipse. In addition to detecting features used to param- 
eterize manipulation behaviors, task specific cues can 
be employed to verify the effects of a robot's actions. 
For example, Okada et al. have demonstrated the value 
of task-specific perception for success detection by hu- 
manoid robots, including detecting that liquid is fiow- 
These systems illustrate that low-dimensional 
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task-specific object representations can result in good 
performance over real- world variation. However, they 
used hand-coded and hand-trained feature detectors 
that required significant engineering effort. With our 
approach, robots autonomously learn to classify visual 



features as being relevant to the success of a specific 
behavior or not. 



2.3 Active Learning and Curiosity Driven Learning 

In many robot learning scenarios, unlabeled data can 
be readily acquired but labeling the data is costly. Re- 
searchers have proposed the use of active learning meth- 
ods |59j to gain more value from limited labeling. In 
many active learning algorithms, at each iterative learn- 
ing step the learner is given an option to select a data 
point to be labeled out of a set of unlabeled data points. 
For one class of proposed approaches, the learner picks 
the data point whose label it is most uncertain about 



17 35 60 . With disagreement-based methods, learner 



ensembles select the data point they most disagree on 
16 . More computationally demanding methods, how- 
ever, attempt to explicitly minimize future expected 



error or variance 7 34 54 . There are also proposals 



to combine semi-supervised and active learning to ex- 
ploit structure in unlabeled data 41 45 69 . Although 



there have been several large scale studies of active 
learning methods on different data sets showing its su- 
periority over randomly picking data points for label- 
the best active learning algorithm to 



31 57 60 



mg 

use in each circumstance is application specific. In our 
work, we use a heuristic that picks the data point closest 
to the decision boundary of a support vector machine 
(SVM) for labeling, a method that has been shown to 



perform well in a variety of applications 25 58 66 . 



3 Approach 

Our approach enables a mobile manipulator to autono- 
mously learn a function that takes a 2D RGB image 
and a registered 3D point cloud as input and returns a 
3D location at which a manipulation behavior is likely 
to succeed. Our approach requires a pair of manipula- 
tion behaviors, verification functions that detect when 
each behavior has been successful, and an initial hint 
as to where one of the behaviors will be successful. 

Each behavior must have input parameters that cor- 
respond with a 3D location that specifies where the be- 
havior will act. During training, our system executes 
each behavior multiple times using different 3D loca- 
tions around the device being manipulated and records 
whether or not the behavior succeeded at each loca- 
tion. For each 3D location, the system creates an im- 
age feature vector using an area of the registered 2D 
RGB image associated with the 3D location. These im- 
age feature vectors are labeled with whether or not the 
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behavior succeeded or failed at their associated 3D lo- 
cations. In other words, the collected data set consists 
of positive and negative examples of image feature vec- 
tors that were or were not associated with the success 
of the behavior. With a classifier trained from this data 
set, the robot can then predict if the associated behav- 
ior will succeed at a 3D location based on the image 
feature vector associated with the location. 

To avoid user intervention during training, our pro- 
cedure trains two behaviors at the same time, switching 
to the other behavior when the current behavior suc- 
ceeds. This enables our method to operate devices that 
can be approximated as having two binary states, such 
as a drawer being open or closed. Using a pair of be- 
haviors allows the robot to change the device back and 
forth between these two states, so that training can con- 
tinue autonomously. For example, instead of training a 
drawer opening behavior in isolation, our process flips 
to training a drawer closing behavior when opening suc- 
ceeds and vice versa until the classifier converges. We 
also formalize the relationship between the two behav- 
iors and define them as complementary behaviors. 

Using self-generated data takes considerable time, 
since each labeled image feature vector requires that 
the robot execute the behavior at a 3D location and 
observe the results. To avoid needing an intractable 
number of trials, our method uses active learning to 
execute the behavior at an informative 3D location at 
each iteration. Specifically, our procedure trains a sup- 
port vector machine (SVM) after each trial using the 
current labeled data. It then uses a heuristic proposed 



cedure in detail (Section 3.3), and classification infras- 



tructure (Section 3.4 1 



by Shohn and Cohn 58 to select the unlabeled image 



feature vector that is closest to the current SVM's deci- 
sion boundary to be labeled next. It then executes the 
behavior at the 3D location associated with this image 
feature vector. 

Our training procedure has two phases. The first is 
an initialization phase where the user selects the behav- 
ior pair to train, gives a seed 3D location, and positions 
the robot's mobile base for training. The next phase is 
an autonomous phase where the SVM active learning 
procedure runs until the learner converges. After con- 
vergence, each behavior has a classifier that predicts 3D 
locations where it will succeed. 

During runtime, if the behavior's verification func- 
tion detects a failed attempt, our procedure appends 
this negative example to the data set, retrains the clas- 
sifier, and tries again using the output of this new clas- 



sifier (Section 3.3.3). 

In the following sections, we discuss the require- 
ments of our learning procedur e] 3. 1[ properties of com- 
plementary behaviors (Section 3.2), our training pro- 



3.1 Requirements 

For our algorithm to apply, several assumptions must 
be met. First, our approach assumes that the robot can 
execute a set of behaviors, each of which only requires 
a 3D location in the robot's frame of reference as initial 
input. We have demonstrated that this is a reasonable 
assumption for a variety of useful mobile manipulation 
behaviors in our previous work on laser pointer inter- 
faces 



46 



Second, this approach assumes that the robot has a 
way of reliably detecting whether or not a behavior it 
has executed was successful or not. Our approach as- 
sumes that the verification function, V, returns whether 
or not a behavior succeeded. For this work, it takes the 
form V{I{b),I{a)), where I{x) is the array of robot sen- 
sor readings when the state of the world is x. The states 
b and a are the states before and after the robot exe- 
cutes a behavior. 

Third, the approach assumes that for each behavior, 
B, there is a complementary behavior, B* . If i? suc- 
cessfully executes, then successful execution of B* will 
return the world to a state that allows B to execute 
again. We discuss the implications of this requirement 
in Section [3^ 

Fourth, each behavior must return a 3D location 
indicating an approximate location where its comple- 
mentary behavior should execute. This requirement is 
provided as a device for the behavior pair to commu- 
nicate the position of the object manipulated to each 
other. These requirements can be summarized as: 



behavior (jp^d) i— ^ [success, p^o) 
behavior* (jp^d) i— ^ [success, p^o) 

3.2 Complementary Behaviors 



(1) 
(2) 



In order to train autonomously, without human inter- 
vention, our procedure uses a complementary pair of 
behaviors during its data gathering process. We intro- 
duce the notion of a complementary robot behavior B* 
to a behavior B as being a behavior that is capable of 
"reversing" the state of the world, so that behavior B 
can be used again. For example, if behavior i?'s func- 
tion is to turn off the lights using a light switch, its 
complement, B* , would turn the lights back on using 
that light switch. If a behavior opens a door, then its 
complement would close the door. 



6 



Hai Nguyen, and Charles C. Kemp 





s 


b 


S* \ 








^ [ G ) j 


E 




B* 





Feature 
Extractor 



Training Loop for Behavior 

3d point 



Training Loop for Compiementary Befiavior 



Active 
Learning for 
SVIVI 



I Complementary ■ 
I Behavior 



3d point 



Active 
Learning for 
SVIM 



Feature 
Extractor 




Fig. 2 Relationships between set S, G, S* , G* , B, and 
B* . Top: An example set of complementary behaviors where 
G* C S and G C S*. In this case, the effect of B is reversible 
using B* . Bottom: An example set of behaviors that are not 
complements with G* ^ S, so B* can produce states that are 
not in S. 



We formalize our notion of complementary behav- 
iors by defining the relationship between ideal comple- 
mentary behaviors. We first define a hypothetical state 
space E that contains the states of everything in the 
world, including the robot's state. We then represent 
execution of behavior B given an initial state of the 
world i £ E as B{i), where B is an operator that takes 
the initial state of the world i as input and returns the 
resulting state of the world r € E. Furthermore, when 
B is applied to a state s & S, where 5 C is a set of 
starting states, it returns g G G, where G C iJ is a set 
of goal states. We define 

G = {g\V{I{i),I{g)) = success Ag = B{i) M £ E} (3) 
and 



S = {s\g eGAg^ B{s) As (EE}. 



(4) 



Intuitively, if the state of the world, s, is a start state, 
s G S, then the behavior B will be successful and the 
resulting state of the world, g = B{s), will be a goal 
state, g & G. 

We now define a complement B* of behavior B to 
have a set of start states, S*, and a set of goal states, 
G*, such that G* C S and G C S* (see Figure ^. 
This guarantees that applying B's complement, B* , af- 
ter successfully applying B will result in a state of the 
world that allows B to once again be applied success- 
fully. More formally, it guarantees that B*{B{i)) € S 
when i e 5, and that B{B*{i)) e S* when i e 5*. 



Fig. 3 Illustration of the classifier training procedure where 
the system trains the complementary behavior upon success 
of the first behavior and vice versa. Dashed orange boxes on 
the two behaviors and success detectors highlight that these 
modules are provided as input to our system. 



3.3 Autonomous Training 
3.3.1 Initialization 

Our initialization procedure is motivated by the sce- 
nario in which a user would take the robot on a home 
tour and point out 3D locations using a green laser 
pointer [46] and specify behaviors applicable to those 
locations. After this tour, the robot would later auto- 
nomously navigate back and learn to robustly perform 
the behaviors. 

For this paper, we have implemented an initializa- 
tion procedure that starts with the user navigating the 
robot to be in front of the device to be operated using 
a gamepad interface. Then using a green laser pointer 
[46], the user designates an initial 3D location to begin 
exploring. The robot samples 3D points around this 
designated location (using a spherical Gaussian with a 
variance of 4 cm) and executes the behavior pair with 
respect to them. After each execution of a behavior at a 
3D location, the behavior's verification function returns 
a label of either success or failure. The sampling pro- 
cess continues until the procedure gathers data points 
from at least one successful and one failed trial. These 
two data points are then used to train SVM classifiers 
that guide the data gathering process with the active 



learning heuristic 58 . 



After this initialization, the robot stores a 2D mo- 
bile base pose with respect to a global map, the user 
provided 3D location, an SVM trained using two la- 
beled data points, and labels indicating which pair of 
behaviors is applicable at the specified location. We il- 
lustrate this procedure in Figure |4] In addition, the user 
navigates the robot to eight different poses in the room, 
referred to as practice poses, each at least a half me- 
ter away from the device. The robot also stores the 2D 
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Fig. 4 Illustration of the initialization procedure for a pair of behaviors that flip light switches. Left Position robot in front 
of the switch. Middle Illuminate an initial 3D location as input to the behavior using a laser pointer. Right A 3D location 
associated with success (green) and a 3D location associated with failure (red) after initialization. Unexplored 3D locations 
are shown in white. 




Fig. 5 Overview of the training procedure: initialization of 
the classifier; specification of practice poses in the environ- 
ment by the user; and a loop that navigates the robot to each 
practice pose and back to the device until the robot gathers 
enough training data. 



mobile base poses associated with these eight practice 
poses. 

3.3.2 Training Procedure 

Our training procedure is designed to emulate condi- 
tions that the robot would encounter when performing 
the task. After receiving a command, the robot nav- 
igates to the device so that it can execute the com- 
manded behavior. Navigation and localization errors re- 
sult in variations that can substantially reduce the per- 
formance of a behavior, such as variation in the robot's 
point of view. We illustrate task variation due to navi- 
gation in Figure [6j Our training method samples from 
this source of task variation by commanding the robot 
to navigate to one of eight practice poses in the room 
and then commanding it to navigate back to the device 
(see Figure [5]). 

After navigating to the device, our procedure be- 
gins an active learning phase (see Figure [S]). We sum- 
marize this phase in Algorithm [T] The process starts 
with the robot capturing an RGB image and a regis- 
tered 3D point cloud. The robot then computes image 
feature vectors for 3D points randomly sampled from 




Fig. 6 This figure shows a visualization of task variation due 
to the robot's mobility. We affixed a red dot at the center of a 
rocker switch. The robot attempted to navigate to the same 
pose and take the same picture of the switch 10 times. This 
image superimposes the red dot from 9 images onto the first 
image to illustrate the wide variation due to navigation. One 
of the 10 dots is obscured by 2 others. The switch plate shown 
has a width of 7.0 cm. If the robot were to use its localization 
estimate to press this switch, most of the attempts would 
result in failure. 



the point cloud around the device {extract-features) . It 
then iteratively selects image feature vectors {svm-pick) 
that it labels by executing the behavior at the asso- 
ciated 3D location and using the verification function 
(execute_behavior) . After each trial, the process retrains 
the SVM classifier with a data set that incorporates the 
newly acquired example {addAnstance_and_retrain_svm). 
The procedure stops after gathering a maximum of six 
labeled image feature vectors or the learner converges 
{stop -Criteria). We imposed this conservative maximum 
limit because image feature vectors gathered from the 
same view are correlated, which can confuse the learn- 
ing heuristic and result in the training process stopping 
prematurely. If the robot operates the device success- 
fully in a trial, the algorithm continues, but uses the 
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behavior, 



Algorithm 1: practice(pomi'^^, 
conip .behavior, stop .criteria) 

instances, candidates'^ = extract_features(pomt'^^); 
while True do 

instance, candidate'^ = svm_pick(behavior, 
instances, candidates'^); 
if stop_criteria(behavior) or 
svm-conv erg ed (behavior, instances) then 

I break; 
end 

success, candidate'^* = 
execute_behavior (behavior, candidate'^); 
add_instance_and_retrain_svm(instance, success) ; 
instances = instances \ instance; 
candidates'^ = candidates'^ \ candidate'^ ; 
if success then 

practice(candidate^^* , comp_behavior, None, 
stop_criteria=stop_on_first_success); 
end 
end 



complementary behavior for the next iteration (section 
following If( success ) ) . 

This process continues until svm-converge is satis- 
fied for each of the eight practice poses. Once it is sat- 
isfied for a particular practice pose, the robot no longer 
navigates to the pose. We define convergence for a prac- 
tice pose to occur when after driving up to the device 
from the practice pose, none of the initially computed 
image feature vectors are closer to the decision bound- 
ary than the current support vectors. 

3.3.3 Behavior Execution Procedure 

The training process above produces a classifier that 
can reliably detect locations where the associated be- 
havior will succeed. To use this classifier, our robot nav- 
igates to the device using the 2D map pose stored dur- 
ing initialization, classifies 3D points in the view that it 
sees, finds the mode of the positive classified points us- 
ing kernel density estimation [l] , selects the 3D point in 
the point cloud closest to this mode, and executes the 
associated behavior using the resulting 3D location. 

If the behavior fails to execute using this 3D loca- 
tion, our procedure adds the associated image feature 
vector as a negative example to the data set and retrains 
the classifier. This new example changes the classifier's 
decision boundary. The robot then selects a new 3D lo- 
cation using the retrained classifier with the originally 
computed image feature vectors. This continues until 
the behavior is successful. It then adds the image fea- 
ture vector associated with this success to the data set 
as a positive example and retrains the SVM. In contrast 
to systems where the execution process is independent 
of data gathering and training, the robot has the op- 



portunity to retrain its classifier when it detects errors 
made during execution, giving the possibility of lifelong 
training. 

3.4 Classification 

The base classifier that we use in this work is a sup- 
port vector machine (SVM). As is standard in super- 
vised classification, given a data set of labeled examples 
D = {{xi,yi), ... {xnjUn)} with Xi € representing 
feature vector i of local 2D appearance information as- 
sociated with a candidate 3D point and Ui € {1,-1} 
where positive and negative denote, respectively, suc- 
cess and failure, we want to be able to predict yj for a 
future instance {xj,yj) ^ D. 

As functional structures on many household devices 
are often small compared to nonfunctional components, 
such as the size of a switch relative to the plate or wall, 
there is typically an unbalanced data set problem, since 
there can be many more negative than positive exam- 
ples. In unbalanced data sets the SVM can return trivial 
solutions that misclassify all the positive samples, since 
the misclassification cost term in the SVM objective is 
defined over all samples. To prevent this issue, we use 
an SVM formulation that separates the costs of misclas- 
sifying the negative class from the cost of misclassifying 
the positive class [12], 

min iw^w + + E 

s.t. yi{w'^(j){x,) + b) >l-^i 

C,>0,^ = 1,...,^, (5) 

where w and b are SVM parameters, counts the mar- 
gin violations for misclassified points (in the case of 
nonseparable data), and (/>() is the radial basis kernel 



function we use (discussed in Section 4.1). 

This formulation separates the SVM misclassifica- 
tion cost scalar C into C+ and which are, respec- 
tively, costs due to negative and positive misclassifica- 
tions. For our system, we set C~ to be 1, and C+ to 
be the number of negative examples over the number of 
positive examples. This scaling keeps the percentage of 
misclassified positive and negative examples similar in 
our skewed data set, where there might be many more 
negative than positive examples. Without this adjust- 
ment, we found that training often returned trivial clas- 
sifiers that classified any input vector as negative. 

3.4-1 Active Learning Heuristic 

Our training process iteratively builds a data set that 
it uses to train the classifier. Before each trial, the sys- 
tem selects the image feature vector to label. To select 
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the feature vector, the system uses a heuristic devel- 
oped in [5^ that selects the feature vector closest to 
the decision boundary of the existing SVM, under the 
condition that it is closer to the boundary than the 
SVM's support vectors. The procedure converges when 
no feature vectors remain that are closer to the decision 
boundary than the support vectors. 

At each iteration i of our procedure, we define the 
previous iteration's data set as -Di-i, the current set 
of support vectors as Xf" = {xf, . . . ,Xp}, the unla- 
beled image feature vectors as = {x1, . . . , x^}, and 
the SVM distance function, which measures distance to 
the decision boundary, as d(xi) = |w^(/)(xi) The 
system selects the unlabeled image feature vector that 
is closest to the decision boundary as specified by the 
following expression: 



argmm 

xf:Vx=" d(xj)<<i(x=") 

3.4.2 Features 



(6) 



The feature generation procedure, which is illustrated 
in Figure [7] takes as input a 3D point cloud, a regis- 
tered high resolution RGB image, and a reference 3D 
point. The system first selects random 3D points from 
the point cloud, without replacement, around the ref- 
erence 3D point according to a Gaussian distribution 
A/'(p, S), where E = diag{vx,Vy,Vz) with Vx,Vy, and 
Vz being, respectively, variances in the x, y, and z di- 
rection. The Gaussian mean p is set to the 3D reference 
point. This Gaussian search prior enables the system to 
save computational effort and focus its attention on the 
device that the robot is supposed to manipulate. 

After randomly selecting a set of 3D points, the sys- 
tem projects each 3D point into the high resolution 
RGB image, proj^pf). For each projected 3D point, it 
collects square image patches of successively increasing 
size centered at the projected 2D point in the RGB 
image, scales these patches to have the same height 
and width, vectorizes them, and concatenates them into 
an image feature vector. The system then uses Princi- 
ple Components Analysis (PGA) to reduce the dimen- 
sionality of these image feature vectors. We discuss the 
specifics of these steps in Section |4.1[ 



of sensors. Our system uses 3D point clouds and 5 
megapixel RGB images from the robot's tilting laser 
range finder and Prosilica camera. 

Starting with a 3D point cloud and registered RGB 
image, our process randomly selects 3D points from the 
point cloud as described in Section |3.4.2| For each se- 
lected 3D point, the system collects image patches at 
4 scales centered around the point's 2D projection in 
the RGB image. The raw image patches have widths of 
41, 81, 161, and 321 pixels. They are then scaled down 
to be 31x31 pixel image patches, vectorized, and con- 
catenated into an 11,532 element image feature vector 
for each 3D point. The vectors are then reduced to 50 
element vectors by projecting them onto PGA basis vec- 
tors that are calculated for each action using the 11,532 
element image feature vectors computed from the first 
3D point cloud and RGB image captured during initial- 
ization. 

To classify these 50 dimensional image feature vec- 
tors, we use SVMs with radial basis function kernels. 
We set the hyperparameters of this kernel using an ar- 
tificially labeled data set. To create the data set we 
took 10 different 3D point clouds and RGB images of a 
light switch from different views and geometrically reg- 
istered them. After hand-labeling one 3D point cloud 
and RGB image, we geometrically propagated labels to 
the other 9. To find the kernel hyperparameters, we 
split the labeled image feature vectors from this data 
set into a training set and a test set. We then performed 
a grid search 12 for the set of hyperparameters that 



best generalized to unseen data in the test set. 



4.2 Behaviors 

To evaluate our system, we implemented three pairs of 
complementary behaviors that operate light switches, 
rocker switches and drawers. These tasks are sensitive 
to the location at which an action is performed. For 
example, light switches are small targets that require 
high precision and accuracy for the PR2 to operate with 
its finger tips. As illustrated in Figure [6] we have found 
that a PR2 will rarely succeed at flipping a light switch 
if it simply navigates to a pre-recorded location and 
moves the arm through a pre-recorded motion without 
visual feedback. 



4 Implementation 



4.1 Learner Parameters 

We implemented our system on on a PR2 robot [2^: a 
mobile manipulator produced by Willow Garage with 
two arms, an omnidirectional base, and a large suite 



4.3 Light Switch Behaviors 

Our light switch behavior's strategy is to reach forward 
to the specified 3D location, stop on contact detected 
with gripper tip tactile sensors, then slide along the 
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Generate an Image 
Feature Vector 




Collect Image Patches Around 
Projected 3D Location 


■ 


■ 


■ 


■ 



Scale and Vectorize the 
Image Patches (4 scales) 



Reduce Dimensionality of the 
Image Vector Using PCA 




Classify ttie Image Feature 
Vectors Using an SVM 



Select a 3D Location Using 
Kernel Density Estimation 



Fig. 7 To select a 3D location at which the behavior is likely to be successful, the system first generates image feature vectors 
for a set of 3D locations. It does so by vectorizing and then reducing the dimensionality of scaled image patches centered 
around the 2D projection of each 3D location. Then it uses an autonomously trained SVM to classify each of these image 
feature vectors as predicting success (blue) or failure (orange) of the behavior. Finally, it selects a specific 3D location using 
kernel density estimation. 
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Move Gripper 
Forward (50 cm) 



Fig. 8 Sequence of actions performed by each of the eight 
behaviors used in this work for operating a light switch, rocker 
switch and drawer. Dotted orange boxes indicate procedures 
for detecting success or failure in a given behavior. 



contacted surface in the direction of the switch. A suc- 
cessful 3D location needs to place the robot's finger so 
that its width will make contact with the switch and 
far enough above or below the switch so that the finger 
will move the switch down or up. Figure [S] shows the 
sequence of actions taken by this behavior. 

The behavior starts with the robot closing its grip- 
per (Close Gripper), moving the gripper to a pre-ma- 
nipulation location (Move to Start Location), reaching 
to the given 3D location (Reach) , flipping the switch by 
sliding along the flat surface (Flip Switch), moving the 



gripper back (Move Gripper Back), then moving back 
to the initial location (Move to Start Location). 

There are a few steps in this behavior where the 
robot detects tactile events. When reaching, the robot 
stops when it detects contact using pressure sensors on 
its finger tips. Next, the sliding movement stops after 
detecting a spike in acceleration with the accelerom- 
eter embedded in the robot's gripper. In the context 
of this task, this spike in acceleration typically corre- 
sponds with the light switch hipping. 

To detect success, our behavior measures the differ- 
ence between the average intensity of an image captured 
before sliding along the surface and an image captured 
after. A large difference indicates that the lighting in- 
tensity changed. 

The complementary behavior is identical except for 
a change in the direction of flipping. After executing, 
the behavior and complementary behavior return the 
3D location input with a predefined offset (± 8 cm). 

4.4 Rocker Switch Behaviors 

Our rocker switch behavior consists solely of a reach- 
ing out step similar to the light switch behavior above, 
since the force applied from contact during the reach 
procedure is enough to activate the switch. A success- 
ful 3D location will result in the robot's fingers pushing 
in the top or bottom of the rocker switch. 

This behavior uses the same image differencing me- 
thod to detect success as the light switch behavior. It 
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tests in the Georgia Tech Aware Home, a residential lab 
on campus used as a test bed for new technologies. 

In each environment, we began our evaluation by 
creating an occupancy grid map of the area with the 
PR2's built-in navigation package |39 . Then, after ini- 



0.5 1.0 
position (meters) 



Fig. 9 Results of experiments for which we used a motion 
capture system to track the robot's pose while navigating 
between two goal poses (blue and red). Green is the path the 
robot took. Stars indicate the final poses of the robot after 
it navigated to the goal poses. Circles show a point 50 cm in 
front of the robot. 



calculates the difference between images captured be- 
fore and after the robot reaches forward. After execut- 
ing, the behavior and complementary behavior return 
the 3D location with a predefined offset (± 5 cm). 



4.5 Drawer Behaviors 

Pulling open and pushing closed a drawer require dif- 
ferent behaviors and success detection methods. Our 
pulling behavior reaches to the drawer handle location, 
detects contact, moves back slightly, grasps with the re- 
active grasper from 21 , and pulls. When pulling, fail- 



ure is detected if the grasp fails or the robot fails to pull 
for at least 10 cm while in contact with the handle. A 
successful 3D location will result in the robot's gripper 
grasping the handle well enough to pull it back by at 
least 10 cm. When pushing, failure is detected if the 
gripper does not remain in contact with the surface for 
at least 10 cm. This classifies events where the robot 
pushes against a closed drawer or an immovable part of 
the environment as failures. After executing, the behav- 
ior and complementary behavior return the 3D location 
the tip of the gripper was in immediately after pulling 
or pushing. 



5 Evaluation 

We evaluated our system using six separate devices. 
We first tested on a rocker switch using the PR2 robot 
named GATSBII in our lab, the Healthcare Robotics 
Lab (HRL) . For the remaining five devices we performed 



tialization (Section 3.3.1 ), we ran the autonomous train- 
ing system (Section 3.3.2 1 until convergence. The exper- 
imenter provided 8 practice poses in the room represen- 
tative of places from which the robot might travel. The 
autonomous training system ran without experimenter 
intervention except for pausing and resuming when the 
robot's batteries ran low. In all, we trained 12 classi- 
fiers, a result of having 6 devices and a pair of behaviors 
for each device (12 = 6*2). 

After finishing the training sessions, we evaluated 
each classifier by running each behavior multiple times, 
giving 110 trials in all (110 trials = (5 devices * 2 be- 
haviors * 10 trials) -I- (1 device * 2 behaviors * 5 tri- 
als)). During each trial we allowed the behavior to retry 
and incorporate information from failures if it did not 
succeed the first time. However, we discarded any data 
gathered during the retry procedure by previous trials 
at the start of each new trial to obtain accurate error 
statistics for the original classifier. 

For the devices we used in our evaluation, the func- 
tional components are difficult for the PR2's laser range 
finder to detect. Light switches only show up as a few 
protruding 3D points similar to other noisy 3D points 
produced by the sensor. The rocker switch appears as 
a flat 2D texture on the 3D point cloud. Drawer han- 
dles tend to be metallic and reflective resulting in an 
absence of 3D points. Using features from RGB images 
enabled the robot to overcome these challenges. 

5. 0. 1 Effects of Navigation Errors 

To better understand the variation in the task due to 
the robot's mobility, we investigated how the pose of 
the PR2 varies when navigating to a goal pose. Using a 
room equipped with a NaturalPoint OptiTrak motion 
capture system, we tracked the pose of the PR2 and 
commanded the robot to navigate back and forth to 
two goal poses 10 times each. As the standard devia- 
tion of the robot's Cartesian position does not represent 
angular errors, we calculated errors for a point 50 cm 
in front of the robot, which is representative of where 
a device would be located. The standard deviation of 
the location in front of the robot was 1.85 cm, and 
1.79 cm in the x and y directions, respectively. For the 
second position, the standard deviation was 1.55 cm 
and 2.38 cm in the x and y directions, respectively. We 
show the results of this experiment in Figure |9] These 
errors demonstrate that navigating to a pre-recorded 
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Table 1 Training examples (abbreviated Ex.) gathered gath- 
ered for each action. 

Action Positive Ex. Negative Ex. Total 



Table 2 For each trained behavior we ran 10 trials. We list 
the number of tries until success for these trials below. 



Action 


1«* Try 


rp^y 


HSI Rocker On 


2 


3 


HSI Rocker Off 


4 


1 


Aware Home Rocker On 


10 




Aware Home Rocker Off 


9 


1 


Ikea Drawer Open 


10 




Ikea Drawer Close 


10 




Brown Drawer Open 


10 




Brown Drawer Close 


10 




Orange Switch On 


8 


2 


Orange Switch Off 


9 


1 


Ornate Switch On 


9 


1 


Ornate Switch Off 


9 


1 



location and moving the arm through a pre-recorded 
motion would result in large variation that can result 
in failure. For example, the robot's finger tips are 2.0 
cm wide and light switches are only 0.8 cm wide. 



ers, the SVM estimates that success can be achieved by 
pulling on the top handle or the bottom handle. The il- 
lustrates a limitation with our current approach, since 
the verification function for pulling a drawer open can 
not tell the difference between the top or the bottom 
drawer. It also shows the influence of the distribution 
used to randomly sample 3D locations. At the same 
time, it suggests that the visual classifiers may have 
some ability to generalize to distinct objects. 

For the light switches, the behaviors slide along the 
surface of the switch. The robot autonomously discov- 
ered that locations that are along the switch plate above 
and below the switch are likely to lead to success. Addi- 
tionally, it does not predict success for locations along 
the wall, which is appropriate since the robot's fingers 
get caught on the switch plate edge if the robot tries to 
slide along the wall to the switch. 

In Table [ij we show the number of examples col- 
lected for each classifier. The median number of exam- 
ples needed was 77, and the maximum needed was 145 
examples. With the rocker switch, where examples are 
noisy due to the middle of the switch being an unre- 
liable spot to push, the number of examples increased 
to 145 indicating a sensitivity of our approach to label 
noise. 

Table[2]shows the results of using these trained clas- 
sifiers after training. Encouragingly, over the 110 trials 
our behavior execution process attained a 100% success 
rate after at most two tries. In addition, errors that led 
to retries usually caused the robot to miss an appropri- 
ate location on the device by a small distance. 



HRL Rocker On 


49 


96 


145 


HRL Rocker Off 


47 


94 


141 


Aware H. Rocker On 


26 


47 


73 


Aware H. Rocker Off 


29 


52 


81 


Ikea Drawer Open 


23 


35 


58 


Ikea Drawer Close 


23 


39 


62 


Brown Drawer Open 


21 


62 


83 


Brown Drawer Close 


25 


46 


71 


Orange Switch On 


17 


43 


60 


Orange Switch Off 


20 


31 


51 


Ornate Switch On 


38 


66 


104 


Ornate Switch Off 


40 


76 


116 



5.1 Results 



Figure 10 shows the locations that the autonomously 
trained SVMs predict will be likely to lead to the suc- 
cess of their associated behaviors. These predictions are 
solely a function of the visual appearance of each loca- 
tion as represented by its image feature vector. These 
visualizations of the classifier output demonstrate that 
the classifiers identify locations relevant to their asso- 
ciated behaviors. For example, the robot autonomously 
discovers that opening a drawer requires grasping at 
the location of the drawer handle, while closing a drawer 
can be performed across the front surface of the drawer. 
The visualizations also show that different drawer han- 
dles can have distinct task-relevant properties. For ex- 
ample, the opening behavior works best when grasp- 
ing the middle of the silver handle, but can succeed by 
grasping the far ends of the brass handle. 

Due to the distribution for random sampling includ- 
ing some points on the lower handles for the white draw- 



6 Future Work 

There are a number of potential extensions to this work, 
and interesting issues left to consider. Although we have 
picked a particular active learning framework, other 
frameworks might perform better. In addition, currently 
we assume that each new device is completely new to 
the robot, but many devices of a particular class have 
visual similarities. Data from other devices might pro- 
vide a prior and reduce the training required. Similarly, 
the structure of successful locations might be shared 
across devices, even if they are visually distinct. For ex- 
ample, the front surfaces of drawers often being push- 
able, the centers of drawers often being pullable, and 
the centers of light switch panels often being switch- 
able could be useful information, even if aspects of their 
appearances change dramatically. 
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Fig. 10 Each pair of images shows classification results of learned detectors just after convergence then on a new test image. 
Green areas mark locations identified as leading to success of associated behaviors. Row 1: Detectors for a rocker switch in 
our lab. Row 2: Detectors for a different rocker switch in the Aware Home. Row 3: Detectors for pushing and pull a wooden 
drawer. Row 4: Detectors for another dark wooden drawer. Row 5: Detectors for a regular light switch. Row 6: Detectors 
for an ornate light switch. 



7 Discussion and Conclusions 

In general, there are risks for a robot that learns in hu- 
man environments and an unrestrained learning system 
can get into situations that are dangerous to itself, to 
the environment, or to people. We address this issue by 
limiting the robot to using a few classes of behaviors 
in parts of the home that users have designated as safe 



for robot learning. Additionally, the behaviors move the 
robot's arm compliantly and use haptic sensing to de- 
cide when to stop moving. By learning in situ, a robot's 
data gathering activities do not have to stop after its 
training phase and can potentially continue for as long 
as the robot remains in service. 

Autonomous learning in human environments is a 
promising area of research that gives robots methods 
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to cope with devices that they have not encountered 
before and many forms of real- world variation. We have 
presented methods that enable a mobile manipulator to 
autonomously learn to visually predict where manipula- 
tion attempts will succeed. As we discussed in the intro- 
duction, our work advances autonomous robot learning 
in three ways. First, our approach uses a robot's mobil- 
ity as an integral part of autonomous learning, which 
enables the robot to handle the significant task vari- 
ation introduced by its mobility. Second, our research 
demonstrates that by using active learning, a robot can 
autonomously learn visual classifiers solely from self- 
generated data in real-world scenarios with a tractable 
number of examples. Third, our research introduces 
complementary behaviors to address challenges associ- 
ated with autonomously learning tasks that change the 
state of the world. 
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